高效能平行處理程式設計：實作導向課程：硬體瓶頸：記憶體與資源限制

現代高效能運算面臨一個根本性的 「記憶體壁壘」：計算吞吐量（每秒浮點運算次數，FLOPS）的爆炸性成長，遠遠超過了記憶體頻寬的微小提升 全域記憶體 頻寬。這種差異使得大型多核心陣列變成了「飢餓」的處理器，只能等待資料傳輸。

雖然GPU每秒可執行數兆次運算，但通往記憶體（DRAM）的物理路徑受到接腳密度與電力需求的限制。 記憶體作為平行化的主要限制因素 表示當併發執行緒數增加時，每個執行緒的頻寬會下降，導致硬體處於停滯狀態，無法有效運作。

想像一個先進的廚房（即GPU核心），每小時可製作1,000份餐點。然而，食材存放在五英里外的倉庫（全域記憶體）中，且只有一輛送貨機車（記憶體匯流排）。不管聘請多少主廚，產出上限仍由機車的速度決定。

一般 多核心CPU系統 利用龐大的快取來隱藏少量繁重執行緒的延遲。然而，大型平行架構卻持續面臨大量併發請求造成的「交通擁塞」。 資源限制 在暫存器與共用記憶體層級的資源限制，決定了硬體過載前所能達到的最大平行度（佔用率）。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of the 'Memory Wall' in modern GPU computing?

The clock speed of cores is too slow to process DRAM data.

Computational throughput (FLOPS) has increased much faster than memory bandwidth.

Shared memory is too large for the hardware to manage.

Global memory has higher latency than CPU registers.

QUESTION 2

In the 'Kitchen Analogy,' what does the delivery scooter represent?

The GPU Core/Chef.

The Register File.

The Global Memory Bus.

The Operating System Scheduler.

QUESTION 3

How do resource limitations like register count affect parallelism?

They increase the speed of each individual thread.

They limit occupancy by reducing the number of active threads that can reside on an SM.

They have no effect on throughput, only on power consumption.

They bypass the need for global memory access.

QUESTION 4

When a kernel is in the 'Memory Bound' region of the Roofline Model, what is the best way to improve performance?

Increase the number of floating-point operations per second.

Increase the arithmetic intensity (data reuse).

Decrease the number of threads per block.

Add more complex branching logic.

QUESTION 5

Why is implicit synchronization unreliable in massively parallel architectures?

Hardware evolution means threads within a warp may not stay locked in SIMT fashion.

Shared memory is too fast for synchronization to matter.

Global memory access is always synchronous.

Threads are processed sequentially in blocks.